PerSubmission Number: 16345
Title: Spectral Perturbation Bounds for Low-Rank Approximation with Applications to Privacy

We provide four Python scripts that reproduce every empirical result in Section~4 and Appendix B.

--------------------------------
Dependencies: 
- Python 3.8+
- Pandas 2.2.2
- NumPy 1.26.4
- SciPy 1.13.0
- Matplotlib 3.9.0

--------------------------------
Execution Instructions:

1. Section 4: Perturbation bounds under Gaussian and Rademacher noises
(*) Files: 
-- "Perturbation_bounds_Gaussian_noise_of_low_rank_approximations.py"
-- "Perturbation_bounds_Rademacher_noise_of_low_rank_approximations.py"

(*) Datasets:
-- The US Census dataset can be downloaded from the UCI repository at https://archive.ics.uci.edu/ml/datasets/US+Census+Data+(1990)

-- The KDD Cup dataset can be downloaded from the UCI repository at https://archive.ics.uci.edu/dataset/129/kdd+cup+1998+data.

-- The Adult dataset can be downloaded from the UCI repository at https://archive.ics.uci.edu/ml/datasets/adult, (or from https://www.kaggle.com/datasets/wenruliu/adult-income-dataset in .csv form)

(*) Overview (Gaussian case) -- same procedure for Rademacher noise: 
(i) Preprocess the dataset (as described in Section 4) and compute the covariance matrix A. 
This step yields (A = Census, n = 69), (A = KDD-Cup, n = 416), and (A = Adult, n = 6).

(ii) Compute the low-rank parameter p such that the Frobenius norm of A_p contains > 99% of the Frobenius norm of A. It yields p = 10 for A = Census, p = 2 for A = KDD and p = 4 for A = Adult. 

(iii) Generate Gaussian noise E and scale it by 20  equally spaced values in [0,1]. For each choice of (A,p), compute: 
-  The actual error of the rank-p approximations
-  The theoretical bound from Theorem 2.1
-  The Eckart-Young-Mirsky bound.
(iv) Plot the means of outputs with error bars.

(*) Output: 
- The Gaussian script produces Figure 1 (Section 4).

- The Rademacher script produces Figure 2 (Section 4).

-------
2. Appendix B: Comparison of error metrics
We compare three error metrics for low-rank approximation:
- Spectral norm error: \(\|\tilde A_p - A_p\|\). 
- Frobenius norm error: \(\|\tilde A_p - A_p\|_F\).
- Change-in-error:  \(\bigl|\|A - A_p\| - \|A - \tilde A_p\|\bigr|\)
------------
(a) Real-world matrices
(*) File: "Comparison_of_error_metrics_real_matrix.py"

(*) Datasets:
-- The US Census dataset can be downloaded from the UCI repository at https://archive.ics.uci.edu/ml/datasets/US+Census+Data+(1990)

-- The KDD Cup dataset can be downloaded from the UCI repository at https://archive.ics.uci.edu/dataset/129/kdd+cup+1998+data.

(*) Overview: 
-- For each covariance matrix ( A = Census or A = KDD-Cup), set E = Gaussian, p = 5. 
-- Run 20 Monte Carlo trials. 
-- Plot all three metrics with error bars. 

(*) Output: 
Produce the second and the third panels in Figure 3 - Appendix B.

(b) Synthetic matrix:
(*) File: "Comparison_of_error_metrics_synthetic_matrix.py"

(*) Overview:

-- Generate a synthetic PSD matrix A of size n = 50 with eigenvalues lambda_i = 0.8^i. 
-- Set E = Gaussian and p = 5. 
-- Run 20 Monte Carlo trials. 
-- Plot all three metrics with error bars.

(*) Output:
Produce the first panel in Figure 3 - Appendix B.
--------------------------
3. Appendix C: Empirical evaluation beyond gap assumption
(*) File: "Perturbation_bound_beyond_gap_assumption.py".
(*) Dataset:
-- The Alon colon-cancer microarray dataset can be downloaded from http://microarray.princeton.edu/oncology/affydata/index.html. 
(*) Overview:
-- Compute the covariance matrix A (n = 2000) and the low-rank parameter p so that A_p contains > 95% of the Frobenius norm of A. It yields p = 9. 
-- Compute the p-eigenvalue gap \delta_p. 
-- Generate a Gaussian noise E = \alpha N(0,I) with \alpha is chosen over 11 evenly spaced values such that \(\|E\|/\delta_p \in \{.05,0.1,...,0.45, 0.5}\). 
-- For each \alpha, compute and print: (1) the true error, (2) the classical bound: \(2(\|E\| + \sigma_{p+1})\), (3) our bound: \(7\|E\|\cdot \frac{\lambda_p}{\delta_p}\), (4) the ratios \(\tfrac{\text{our bound}}{\text{true error}}\) 
          and \(\tfrac{\text{our bound}}{\text{classical bound}}\).
(*) Output:
Produce Table 2 - Appendix C. 

Summary of outputs
| # | Output        | Script                                                             |Datasets                            |Approx. run‑time|            
|---|---------------|--------------------------------------------------------------------|------------------------------------|----------------|
| 1 | Fig. 1        | Perturbation_bounds_Gaussian_noise_of_low_rank_approximations.py   | 1990 US Census, 1998 KDD Cup, Adult| < 11 mins      |
| 2 | Fig. 2        | Perturbation_bounds_Rademacher_noise_of_low_rank_approximations.py | 1990 US Census, 1998 KDD Cup, Adult| < 11 mins      |
| 3 | Fig.3: Panel 2|Comparison_of_error_metrics_real_matrix.py                          | 1990 US Census                     | ~ 1 min        |
| 4 | Fig.3: Panel 3|Comparison_of_error_metrics_real_matrix.py                          | 1998 KDD Cup                       | ~ 4 mins       |
| 5 | Fig.3: Panel 1|Comparison_of_error_metrics_synthetic_matrix.py                     | N/A                                | ~ 4s           |
| 6 | Table 2       | Perturbation_bound_beyond_gap_assumption.py                        | Alon colon-cancer microarray       | ~ 1 min        |
 --------------------------------

Final remarks:
-- All scripts produce plots with error bars as shown in the paper.

-- For each experiment, low-rank approximations are computed via SVD using NumPy.

-- Random seeds are not fixed; results may vary slightly across runs.

-- The experiments are lightweight and run on standard CPU machines.

-- All datasets are in the public domain or released under open academic licenses.

